{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Weight Of Evidence encoding"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Binning create buckets of independent variables based on ranking methods. Binning helps us converting continuous variables into categorical ones.\n",
    "\n",
    "WOE binning Implements a binning of numeric variables and factors with respect to a dichotomous target variable.\n",
    "\n",
    "```\n",
    "bin_total = bin_positives + bin_negatives\n",
    "total_labels = total_positives + total_negatives\n",
    "bin_WOE = log((bin_positives / total_positives) / (bin_negatives / total_negatives))\n",
    "bin_iv = ((bin_positives / total_positives) - (bin_negatives / total_negatives)) * bin_woe\n",
    "```\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently we provide WOE encoding for vertically partitioned datasets.\n",
    "\n",
    "Let's first load a sample dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import secretflow as sf\n",
    "from secretflow.data.vertical import VDataFrame\n",
    "from secretflow.utils.simulation.datasets import load_linear"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-06-26 13:58:28,126\tINFO worker.py:1544 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8266 \u001b[39m\u001b[22m\n"
     ]
    }
   ],
   "source": [
    "\n",
    "sf.shutdown()\n",
    "sf.init(['alice', 'bob'], address='local')\n",
    "alice, bob = sf.PYU('alice'), sf.PYU('bob')\n",
    "spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "parts = {\n",
    "        bob: (1, 11),\n",
    "        alice: (11, 22),\n",
    "    }\n",
    "vdf = load_linear(parts=parts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_data = vdf['y']\n",
    "y = sf.reveal(label_data.partitions[alice].data).values"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we are ready to perform WOE binning and substitution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party alice.\n",
      "INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party bob.\n",
      "\u001b[2m\u001b[36m(_run pid=3833768)\u001b[0m INFO:jax._src.lib.xla_bridge:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: \n",
      "\u001b[2m\u001b[36m(_run pid=3833768)\u001b[0m INFO:jax._src.lib.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n",
      "\u001b[2m\u001b[36m(_run pid=3833768)\u001b[0m INFO:jax._src.lib.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n",
      "\u001b[2m\u001b[36m(_run pid=3833768)\u001b[0m INFO:jax._src.lib.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n",
      "\u001b[2m\u001b[36m(_run pid=3833768)\u001b[0m INFO:jax._src.lib.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n",
      "\u001b[2m\u001b[36m(_run pid=3833768)\u001b[0m WARNING:jax._src.lib.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2m\u001b[36m(_run pid=3833768)\u001b[0m [2023-06-26 13:58:34.369] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n",
      "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3839631)\u001b[0m 2023-06-26 13:58:34.779 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n",
      "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3839630)\u001b[0m 2023-06-26 13:58:34.869 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_substitution.VertWOESubstitutionPyuWorker'> with party bob.\n",
      "INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_substitution.VertWOESubstitutionPyuWorker'> with party alice.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           x11       x12       x13       x14       x15       x16       x17  \\\n",
      "0     0.241531 -0.705729 -0.020094 -0.486932  0.851992  0.035219 -0.796096   \n",
      "1    -0.402727  0.115744  0.468149 -0.697152  0.386395  0.712798  0.239583   \n",
      "2     0.872675 -0.559321  0.390246  0.000472  0.225594 -0.639674  0.279511   \n",
      "3    -0.644718 -0.409382  0.141747 -0.797517  0.314084 -0.802476  0.348878   \n",
      "4    -0.949669 -0.940787 -0.951708  0.187475  0.272346  0.124419  0.853226   \n",
      "...        ...       ...       ...       ...       ...       ...       ...   \n",
      "9995 -0.031331 -0.078700 -0.020636 -0.575713  0.210120 -0.288943 -0.262945   \n",
      "9996  0.047039  0.965614 -0.921435 -0.092970  0.205778  0.155392  0.922683   \n",
      "9997  0.269438 -0.115586  0.928880  0.430016  0.269042 -0.331772  0.520971   \n",
      "9998  0.999325  0.433372 -0.805999  0.311548  0.072405  0.973399 -0.123470   \n",
      "9999 -0.203443  0.772931 -0.146181 -0.195646  0.274590  0.803816 -0.312047   \n",
      "\n",
      "           x18       x19       x20  y  \n",
      "0     0.810261  0.048303  0.937679  1  \n",
      "1     0.312728  0.526637  0.589773  1  \n",
      "2     0.039087 -0.753417  0.516735  0  \n",
      "3    -0.855979  0.250944  0.979465  1  \n",
      "4    -0.238805  0.243109 -0.121446  1  \n",
      "...        ...       ...       ... ..  \n",
      "9995 -0.847253  0.069960  0.786748  1  \n",
      "9996 -0.502486 -0.076290 -0.604832  1  \n",
      "9997 -0.424209  0.434947  0.998955  1  \n",
      "9998  0.914291 -0.473056  0.616257  1  \n",
      "9999 -0.602927 -0.021368  0.885519  0  \n",
      "\n",
      "[10000 rows x 11 columns]\n",
      "            x1        x2        x3        x4        x5        x6        x7  \\\n",
      "0    -0.514226  0.730010 -0.730391  0.970483  0.038063 -0.800808 -0.082006   \n",
      "1    -0.725537  0.482244 -0.823223  0.202119  0.484290 -0.139781 -0.341216   \n",
      "2     0.608353 -0.071102 -0.775098 -0.391496  0.224127  0.082370 -0.341216   \n",
      "3    -0.686642  0.160470  0.914477 -0.269052  0.224127 -0.547841 -0.341216   \n",
      "4    -0.198111  0.212909  0.950474  0.775259 -0.590206 -0.840528 -0.742056   \n",
      "...        ...       ...       ...       ...       ...       ...       ...   \n",
      "9995 -0.367246 -0.296454  0.558596 -0.403504  0.542892  0.000142 -0.341216   \n",
      "9996  0.010913  0.629268 -0.384093 -0.552787  0.542892 -0.100838  0.071673   \n",
      "9997 -0.238097  0.904069 -0.344859 -0.687887 -0.103900  0.223052 -0.286633   \n",
      "9998  0.453686 -0.375173  0.899238  0.908135 -0.590206  0.524051  0.347251   \n",
      "9999 -0.776015 -0.772112  0.012110 -0.898067  0.182405 -0.500491  0.557853   \n",
      "\n",
      "            x8        x9       x10  \n",
      "0    -0.499206 -0.750112 -0.910640  \n",
      "1    -0.652901  0.438065  0.830206  \n",
      "2    -0.183506 -0.783842 -0.729929  \n",
      "3    -0.269405 -0.974268 -0.800515  \n",
      "4     0.800389  0.185542  0.183614  \n",
      "...        ...       ...       ...  \n",
      "9995 -0.470127 -0.247682 -0.552526  \n",
      "9996  0.592903 -0.577123 -0.811461  \n",
      "9997 -0.172245  0.713149 -0.184585  \n",
      "9998 -0.558997  0.610076 -0.862191  \n",
      "9999 -0.275658 -0.250420  0.518420  \n",
      "\n",
      "[10000 rows x 10 columns]\n"
     ]
    }
   ],
   "source": [
    "from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning\n",
    "from secretflow.preprocessing.binning.vert_woe_substitution import VertWOESubstitution\n",
    "\n",
    "binning = VertWoeBinning(spu)\n",
    "woe_rules = binning.binning(\n",
    "    vdf,\n",
    "    binning_method=\"chimerge\",\n",
    "    bin_num=4,\n",
    "    bin_names={alice: [], bob: [\"x5\", \"x7\"]},\n",
    "    label_name=\"y\",\n",
    ")\n",
    "\n",
    "woe_sub = VertWOESubstitution()\n",
    "vdf = woe_sub.substitution(vdf, woe_rules)\n",
    "\n",
    "# this is for demo only, be careful with reveal\n",
    "print(sf.reveal(vdf.partitions[alice].data))\n",
    "print(sf.reveal(vdf.partitions[bob].data))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes we may need the iv values. Releasing bin ivs will potentially leak label information according to issue https://github.com/secretflow/secretflow/issues/565.\n",
    "Currently, we choose to save bin iv values in label holders device. It is up to label holder's choice to\n",
    "\n",
    "1. share no iv information\n",
    "2. share some chosen iv information\n",
    "\n",
    "We will demonstrate how to share the feature ivs."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that the woe_rules is a dictionary `{PYU: PYUObject}`, where each `PYUObject` itself is a dictionary of the following type:\n",
    "```\n",
    "{\n",
    "    \"variables\":[\n",
    "        {\n",
    "            \"name\": str, # feature name\n",
    "            \"type\": str, # \"string\" or \"numeric\", if feature is discrete or continuous\n",
    "            \"categories\": list[str], # categories for discrete feature\n",
    "            \"split_points\": list[float], # left-open right-close split points\n",
    "            \"total_counts\": list[int], # total samples count in each bins.\n",
    "            \"else_counts\": int, # np.nan samples count\n",
    "            \"woes\": list[float], # woe values for each bins.\n",
    "            \"else_woe\": float, # woe value for np.nan samples.\n",
    "        },\n",
    "        # ... others feature\n",
    "    ],\n",
    "    # label holder's PYUObject only\n",
    "    # warning: giving bin_ivs to other party will leak positive samples in each bin.\n",
    "    # it is up to label holder's will to give feature iv or bin ivs or all info to workers.\n",
    "    # for more information, look at: https://github.com/secretflow/secretflow/issues/565\n",
    "\n",
    "    # in the following comment, by safe we mean label distribution info is not leaked.\n",
    "    \"feature_iv_info\" :[\n",
    "        {\n",
    "            \"name\": str, #feature name\n",
    "            \"ivs\": list[float], #iv values for each bins, not safe to share with workers in any case.\n",
    "            \"else_iv\": float, #iv for nan values, may share to with workers\n",
    "            \"feature_iv\": float, #sum of bin_ivs, safe to share with workers when bin num > 2.\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# alice is label holder\n",
    "dict_pyu_object = woe_rules[alice]\n",
    "\n",
    "def extract_name_and_feature_iv(list_of_feature_iv_info):\n",
    "    return [(d[\"name\"], d[\"feature_iv\"]) for d in list_of_feature_iv_info]\n",
    "\n",
    "feature_ivs = alice(lambda dict_pyu_object: extract_name_and_feature_iv(dict_pyu_object[\"feature_iv_info\"]))(dict_pyu_object)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('x5', 0.37848298069087766), ('x7', 0)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we can give the feature_ivs to bob\n",
    "feature_ivs.to(bob)\n",
    "# and/or we can reveal it to see it\n",
    "sf.reveal(feature_ivs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Congradulations!\n",
    "In this tutorial we have learnt how to\n",
    "\n",
    "1. do WOE encoding\n",
    "2. share some iv information to other parties\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py38",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
